O conjunto de dados é composto por clientes da empresa Ifood com dados sobre:
* Perfis de clientes
* Preferências do produto
* Sucessos/fracassos da campanha
* Desempenho do canal
import pandas as pd
import seaborn as sns #manipular gráfico
import numpy as np #operações numpy
import matplotlib.pyplot as plt #manipular gráfico
import plotly.express as px
pd.options.display.max_columns
20
#pd.options.display.max_columns = None
df = pd.read_csv("mkt_data.csv")
df
| Unnamed: 0 | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | ... | education_Graduation | education_Master | education_PhD | MntTotal | MntRegularProds | AcceptedCmpOverall | marital_status | education_level | kids | expenses | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 58138.0 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | 88 | ... | 3.0 | NaN | NaN | 1529 | 1441 | 0 | Single | Graduation | 0 | 1529 |
| 1 | 1 | 46344.0 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | 1 | ... | 3.0 | NaN | NaN | 21 | 15 | 0 | Single | Graduation | 2 | 21 |
| 2 | 2 | 71613.0 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | 21 | ... | 3.0 | NaN | NaN | 734 | 692 | 0 | Together | Graduation | 0 | 734 |
| 3 | 3 | 26646.0 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | 3 | ... | 3.0 | NaN | NaN | 48 | 43 | 0 | Together | Graduation | 1 | 48 |
| 4 | 4 | 58293.0 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | 27 | ... | NaN | NaN | 5.0 | 407 | 392 | 0 | Married | PhD | 1 | 407 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2200 | 2200 | 61223.0 | 0 | 1 | 46 | 709 | 43 | 182 | 42 | 118 | ... | 3.0 | NaN | NaN | 1094 | 847 | 0 | Married | Graduation | 1 | 1094 |
| 2201 | 2201 | 64014.0 | 2 | 1 | 56 | 406 | 0 | 30 | 0 | 0 | ... | NaN | NaN | 5.0 | 436 | 428 | 1 | Together | PhD | 3 | 436 |
| 2202 | 2202 | 56981.0 | 0 | 0 | 91 | 908 | 48 | 217 | 32 | 12 | ... | 3.0 | NaN | NaN | 1217 | 1193 | 1 | Divorced | Graduation | 0 | 1217 |
| 2203 | 2203 | 69245.0 | 0 | 1 | 8 | 428 | 30 | 214 | 80 | 30 | ... | NaN | 4.0 | NaN | 782 | 721 | 0 | Together | Master | 1 | 782 |
| 2204 | 2204 | 52869.0 | 1 | 1 | 40 | 84 | 3 | 61 | 2 | 1 | ... | NaN | NaN | 5.0 | 151 | 130 | 0 | Married | PhD | 2 | 151 |
2205 rows × 44 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2205 entries, 0 to 2204 Data columns (total 44 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 2205 non-null int64 1 Income 2205 non-null float64 2 Kidhome 2205 non-null int64 3 Teenhome 2205 non-null int64 4 Recency 2205 non-null int64 5 MntWines 2205 non-null int64 6 MntFruits 2205 non-null int64 7 MntMeatProducts 2205 non-null int64 8 MntFishProducts 2205 non-null int64 9 MntSweetProducts 2205 non-null int64 10 MntGoldProds 2205 non-null int64 11 NumDealsPurchases 2205 non-null int64 12 NumWebPurchases 2205 non-null int64 13 NumCatalogPurchases 2205 non-null int64 14 NumStorePurchases 2205 non-null int64 15 NumWebVisitsMonth 2205 non-null int64 16 AcceptedCmp3 2205 non-null int64 17 AcceptedCmp4 2205 non-null int64 18 AcceptedCmp5 2205 non-null int64 19 AcceptedCmp1 2205 non-null int64 20 AcceptedCmp2 2205 non-null int64 21 Complain 2205 non-null int64 22 Z_CostContact 2205 non-null int64 23 Z_Revenue 2205 non-null int64 24 Response 2205 non-null int64 25 Age 2205 non-null int64 26 Customer_Days 2205 non-null int64 27 marital_Divorced 230 non-null float64 28 marital_Married 854 non-null float64 29 marital_Single 477 non-null float64 30 marital_Together 568 non-null float64 31 marital_Widow 76 non-null float64 32 education_2n Cycle 198 non-null float64 33 education_Basic 54 non-null float64 34 education_Graduation 1113 non-null float64 35 education_Master 364 non-null float64 36 education_PhD 476 non-null float64 37 MntTotal 2205 non-null int64 38 MntRegularProds 2205 non-null int64 39 AcceptedCmpOverall 2205 non-null int64 40 marital_status 2205 non-null object 41 education_level 2205 non-null object 42 kids 2205 non-null int64 43 expenses 2205 non-null int64 dtypes: float64(11), int64(31), object(2) memory usage: 758.1+ KB
1. Quantos dados temos? Linhas e colunas
df.shape
(2205, 44)
2. Quais são as colunas numéricas?
df_numeric = df.select_dtypes('int64','float64')
df_numeric.columns
Index(['Unnamed: 0', 'Kidhome', 'Teenhome', 'Recency', 'MntWines', 'MntFruits',
'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue', 'Response',
'Age', 'Customer_Days', 'MntTotal', 'MntRegularProds',
'AcceptedCmpOverall', 'kids', 'expenses'],
dtype='object')
3. Temos duplicados na nossa base? Se tivermos, retire-os
df.duplicated().sum()
0
Não há duplicados.
4. Temos dados nulos nessa base? Será que eles indicam algo? O que fazer com eles?
df.isnull().sum()
Unnamed: 0 0 Income 0 Kidhome 0 Teenhome 0 Recency 0 MntWines 0 MntFruits 0 MntMeatProducts 0 MntFishProducts 0 MntSweetProducts 0 MntGoldProds 0 NumDealsPurchases 0 NumWebPurchases 0 NumCatalogPurchases 0 NumStorePurchases 0 NumWebVisitsMonth 0 AcceptedCmp3 0 AcceptedCmp4 0 AcceptedCmp5 0 AcceptedCmp1 0 AcceptedCmp2 0 Complain 0 Z_CostContact 0 Z_Revenue 0 Response 0 Age 0 Customer_Days 0 marital_Divorced 1975 marital_Married 1351 marital_Single 1728 marital_Together 1637 marital_Widow 2129 education_2n Cycle 2007 education_Basic 2151 education_Graduation 1092 education_Master 1841 education_PhD 1729 MntTotal 0 MntRegularProds 0 AcceptedCmpOverall 0 marital_status 0 education_level 0 kids 0 expenses 0 dtype: int64
As colunas que apresentam valores nulos, aparentam estar corretas, uma vez que nem todas as pessoas possuem graduação e pós graduação ou são casadas, divorciadas, entre outras atribuições.
colunas_nulas = ['marital_Divorced',
'marital_Married',
'marital_Single',
'marital_Together',
'marital_Widow',
'education_2n Cycle',
'education_Basic',
'education_Graduation',
'education_Master',
'education_PhD']
type(colunas_nulas)
list
df_nulos = df[colunas_nulas]
df_nulos
| marital_Divorced | marital_Married | marital_Single | marital_Together | marital_Widow | education_2n Cycle | education_Basic | education_Graduation | education_Master | education_PhD | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | NaN | 4.0 | NaN | NaN | NaN | NaN | 3.0 | NaN | NaN |
| 1 | NaN | NaN | 4.0 | NaN | NaN | NaN | NaN | 3.0 | NaN | NaN |
| 2 | NaN | NaN | NaN | 3.0 | NaN | NaN | NaN | 3.0 | NaN | NaN |
| 3 | NaN | NaN | NaN | 3.0 | NaN | NaN | NaN | 3.0 | NaN | NaN |
| 4 | NaN | 5.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2200 | NaN | 5.0 | NaN | NaN | NaN | NaN | NaN | 3.0 | NaN | NaN |
| 2201 | NaN | NaN | NaN | 3.0 | NaN | NaN | NaN | NaN | NaN | 5.0 |
| 2202 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | 3.0 | NaN | NaN |
| 2203 | NaN | NaN | NaN | 3.0 | NaN | NaN | NaN | NaN | 4.0 | NaN |
| 2204 | NaN | 5.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 5.0 |
2205 rows × 10 columns
df[colunas_nulas] = np.where(df[colunas_nulas].isnull(),0,1)
df
| Unnamed: 0 | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | ... | education_Graduation | education_Master | education_PhD | MntTotal | MntRegularProds | AcceptedCmpOverall | marital_status | education_level | kids | expenses | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 58138.0 | 0 | 0 | 58 | 635 | 88 | 546 | 172 | 88 | ... | 1 | 0 | 0 | 1529 | 1441 | 0 | Single | Graduation | 0 | 1529 |
| 1 | 1 | 46344.0 | 1 | 1 | 38 | 11 | 1 | 6 | 2 | 1 | ... | 1 | 0 | 0 | 21 | 15 | 0 | Single | Graduation | 2 | 21 |
| 2 | 2 | 71613.0 | 0 | 0 | 26 | 426 | 49 | 127 | 111 | 21 | ... | 1 | 0 | 0 | 734 | 692 | 0 | Together | Graduation | 0 | 734 |
| 3 | 3 | 26646.0 | 1 | 0 | 26 | 11 | 4 | 20 | 10 | 3 | ... | 1 | 0 | 0 | 48 | 43 | 0 | Together | Graduation | 1 | 48 |
| 4 | 4 | 58293.0 | 1 | 0 | 94 | 173 | 43 | 118 | 46 | 27 | ... | 0 | 0 | 1 | 407 | 392 | 0 | Married | PhD | 1 | 407 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2200 | 2200 | 61223.0 | 0 | 1 | 46 | 709 | 43 | 182 | 42 | 118 | ... | 1 | 0 | 0 | 1094 | 847 | 0 | Married | Graduation | 1 | 1094 |
| 2201 | 2201 | 64014.0 | 2 | 1 | 56 | 406 | 0 | 30 | 0 | 0 | ... | 0 | 0 | 1 | 436 | 428 | 1 | Together | PhD | 3 | 436 |
| 2202 | 2202 | 56981.0 | 0 | 0 | 91 | 908 | 48 | 217 | 32 | 12 | ... | 1 | 0 | 0 | 1217 | 1193 | 1 | Divorced | Graduation | 0 | 1217 |
| 2203 | 2203 | 69245.0 | 0 | 1 | 8 | 428 | 30 | 214 | 80 | 30 | ... | 0 | 1 | 0 | 782 | 721 | 0 | Together | Master | 1 | 782 |
| 2204 | 2204 | 52869.0 | 1 | 1 | 40 | 84 | 3 | 61 | 2 | 1 | ... | 0 | 0 | 1 | 151 | 130 | 0 | Married | PhD | 2 | 151 |
2205 rows × 44 columns
5. Qual é a média, mediana, 25 percentil, 75 percentil, mínimo e máximo de cada uma das colunas numéricas?
df.describe()
| Unnamed: 0 | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | ... | education_2n Cycle | education_Basic | education_Graduation | education_Master | education_PhD | MntTotal | MntRegularProds | AcceptedCmpOverall | kids | expenses | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2205.000000 | 2205.000000 | 2205.000000 | 2205.000000 | 2205.000000 | 2205.000000 | 2205.000000 | 2205.000000 | 2205.000000 | 2205.000000 | ... | 2205.000000 | 2205.000000 | 2205.000000 | 2205.000000 | 2205.000000 | 2205.000000 | 2205.000000 | 2205.00000 | 2205.000000 | 2205.000000 |
| mean | 1102.000000 | 51622.094785 | 0.442177 | 0.506576 | 49.009070 | 306.164626 | 26.403175 | 165.312018 | 37.756463 | 27.128345 | ... | 0.089796 | 0.024490 | 0.504762 | 0.165079 | 0.215873 | 562.764626 | 518.707483 | 0.29932 | 0.948753 | 562.764626 |
| std | 636.672993 | 20713.063826 | 0.537132 | 0.544380 | 28.932111 | 337.493839 | 39.784484 | 217.784507 | 54.824635 | 41.130468 | ... | 0.285954 | 0.154599 | 0.500091 | 0.371336 | 0.411520 | 575.936911 | 553.847248 | 0.68044 | 0.749231 | 575.936911 |
| min | 0.000000 | 1730.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 4.000000 | -283.000000 | 0.00000 | 0.000000 | 4.000000 |
| 25% | 551.000000 | 35196.000000 | 0.000000 | 0.000000 | 24.000000 | 24.000000 | 2.000000 | 16.000000 | 3.000000 | 1.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 56.000000 | 42.000000 | 0.00000 | 0.000000 | 56.000000 |
| 50% | 1102.000000 | 51287.000000 | 0.000000 | 0.000000 | 49.000000 | 178.000000 | 8.000000 | 68.000000 | 12.000000 | 8.000000 | ... | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 343.000000 | 288.000000 | 0.00000 | 1.000000 | 343.000000 |
| 75% | 1653.000000 | 68281.000000 | 1.000000 | 1.000000 | 74.000000 | 507.000000 | 33.000000 | 232.000000 | 50.000000 | 34.000000 | ... | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 964.000000 | 884.000000 | 0.00000 | 1.000000 | 964.000000 |
| max | 2204.000000 | 113734.000000 | 2.000000 | 2.000000 | 99.000000 | 1493.000000 | 199.000000 | 1725.000000 | 259.000000 | 262.000000 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2491.000000 | 2458.000000 | 4.00000 | 3.000000 | 2491.000000 |
8 rows × 42 columns
1. Qual é o maior salário encontrado na nossa base? Veja na coluna Income
df.nlargest(1,'Income')
| Unnamed: 0 | Income | Kidhome | Teenhome | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | ... | education_Graduation | education_Master | education_PhD | MntTotal | MntRegularProds | AcceptedCmpOverall | marital_status | education_level | kids | expenses | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1872 | 1872 | 113734.0 | 0 | 0 | 9 | 6 | 2 | 3 | 1 | 262 | ... | 0 | 0 | 1 | 274 | 271 | 0 | Single | PhD | 0 | 274 |
1 rows × 44 columns
max_income = df['Income'].nlargest(1).values[0]
max_income
113734.0
print(f"O maior salário da base é: {max_income}")
O maior salário da base é: 113734.0
2. Qual é a distribuição de salário na nossa base? Temos clientes que ganham muito bem e outros que não ganham muito bem? Veja na coluna Income
gr2_hist_income = px.histogram(df,
x='Income',
nbins=30)
gr2_hist_income
gr2_hist_box = px.histogram(df,
x='Income',
nbins=40,
marginal='box')
gr2_hist_box.update_traces(marker_color = 'blue')
gr2_hist_box
gr2_px_box = px.box(df,
y='Income')
gr2_px_box.update_traces(marker_color='black')
gr2_px_box.update_layout(title_text = 'Distribuição de salário')
gr2_px_box
# Utilizando a biblioteca Seaborn:
sns.boxplot(df['Income'])
<Axes: >
3. Nossos clientes tem níveis de educação maiores ou menores? Veja na coluna education_level
gr3 = px.histogram(df,
x='education_level'
)
gr3.update_traces(marker_color = 'orange')
# gr3.update_traces(text = df.groupby('education_level')['education_level'].count()) NÃO RETORNA ROTULOS CORRETOS
gr3
Os clientes em sua grande maioria possuem ensino superior.
4. Quantos clientes temos em cada estado civil? Veja na coluna marital_status
gr4 = px.histogram(df,
x = 'marital_status')
gr4
A maior parte dos clientes são casados ou moram junto com um parceiro/a.
A minoria dos clientes são viúvos.
1. Qual é a relação de estado civil com número de filhos? Será que as pessoas casadas têm um maior número de filhos? Filhos está na coluna kids
gr1_kids = px.box(df,
x='marital_status',
y='kids')
gr1_kids
df.groupby('marital_status')['kids'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| marital_status | ||||||||
| Divorced | 230.0 | 1.008696 | 0.741716 | 0.0 | 1.0 | 1.0 | 1.0 | 3.0 |
| Married | 854.0 | 0.967213 | 0.732019 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 |
| Single | 477.0 | 0.859539 | 0.777289 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 |
| Together | 568.0 | 0.980634 | 0.749823 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 |
| Widow | 76.0 | 0.881579 | 0.747745 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 |
df.groupby('marital_status')['kids'].median()
marital_status Divorced 1.0 Married 1.0 Single 1.0 Together 1.0 Widow 1.0 Name: kids, dtype: float64
Os valores retornados não são suficientes para se ter uma distinção se pessoas casadas de fato possuem mais filhos.
2. As pessoas gastam mais ou menos em nossa plataforma quando têm filhos? Veja nas colunas expenses e kids
gr2_kids = px.box(df,
x='kids',
y='expenses')
gr2_kids
df.groupby('kids')['expenses'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| kids | ||||||||
| 0 | 628.0 | 1041.207006 | 625.734894 | 5.0 | 546.00 | 1106.5 | 1505.50 | 2491.0 |
| 1 | 1112.0 | 434.530576 | 452.606603 | 4.0 | 50.00 | 264.0 | 762.25 | 2036.0 |
| 2 | 415.0 | 221.568675 | 290.259678 | 8.0 | 39.00 | 84.0 | 296.50 | 1627.0 |
| 3 | 50.0 | 237.380000 | 390.174157 | 6.0 | 28.25 | 62.0 | 254.00 | 1601.0 |
df.groupby('kids')['expenses'].median()
kids 0 1106.5 1 264.0 2 84.0 3 62.0 Name: expenses, dtype: float64
df.groupby('kids')['expenses'].mean()
kids 0 1041.207006 1 434.530576 2 221.568675 3 237.380000 Name: expenses, dtype: float64
Segundo os valores de média e mediana, as pessoas que não possuem filhos tendem a gastar mais.
A mediana e a média dos dados referente a 1,2 e 3 filhos são muito distintas, devido a presença de outliers.
3. Pessoas que têm um maior salário gastam mais? Veja nas colunas Income e expenses
px.scatter(df,
x='Income',
y='expenses')
df['expenses'].corr(df['Income'])
0.823066002139838
A correlação de pearson resultou em 0.82, logo a correlação é diretamente proporcional, pessoas que ganham mais tendem a gastar mais.